Discriminating signal from background using neural networks. 
Application to top— quark search at the Fermilab Tevatron * 

LI. Ametller a , LI. Garrido fc ' c , G. Stimpfl-Abele a ' b , P. Talavera a and P. Yepes d 
a Departament de Fisica i Enginyeria Nuclear, Universitat Politecnica de Catalunya, E-08034 

Barcelona, Spain 

h Departament Estructura i Constituents Materia, Universitat de Barcelona, E-08028 Barcelona, 

Spain 

c Institut de Fisica d'Altes Energies, Universitat Autdnoma de Barcelona, E-08193 Bellaterra 

(Barcelona), Spain 
d Rice University, Houston, TX 77251-1892, USA 
(February 1, 2008) 

Abstract 

The application of Neural Networks in High Energy Physics to the separation 
of signal from background events is studied. A variety of problems usually 
encountered in this sort of analyses, from variable selection to systematic 
errors, are presented. The top-quark search is used as an example to illustrate 
the problems and proposed solutions. 
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It is well known that neural networks (NN's) are useful tools for pattern recognition. 
In High Energy Physics, they have been used or proposed as good candidates for tasks of 
signal versus background classification. However, most of the existing studies are some- 
what academic, in the sense that they essentially compare the NN performances with other 
classical techniques of classification using Monte Carlo (MC) events for that purpose. In 
realistic applications, real events should be analyzed and compared with simulated events, 
introducing systematic effects which have to be taken into account and could significantly 
modify the efficiency of the analysis. We try to give some insight in this direction using the 
top quark search at the Fermilab Tevatron as illustration. The top quark has been observed 
by the CDF [p]] and DO collaborations. Recently, NN's have been applied to experimental 
top quark searches by the DO Collaboration H, for a fixed top quark mass, concluding that 
NN's are more efficient than traditional methods, in agreement with previous parton level 
studies ||. 

In this paper we continue and complete the analysis of Ref. [|J] for the top quark search 
at the Tevatron. A more realistic study is performed by including parton hadronization and 
detector simulation with jet reconstruction. In addition, contrary to Ref. (3| where the top 
mass was fixed, the present study is valid for a large range of top mass values. Moreover, 
the number of kinematical variables considered is enlarged and different ways of selecting 
subsets of the most relevant ones to the process under consideration are discussed. Finally, 
the influence of systematic errors on the NN results is studied. 

The analysis is focused on the top quark search at the pp Fermilab Tevatron operating 
at y/s = 1.8 TeV. The one-charged-lepton channel, pp — > ti —>■ Ivjjjj with I = e , \i , is 
considered as the signal to look for. The main background is pp — > Wjjjj — > Ivjjjj. Exact 
tree-level amplitudes with spin correlations were used to generate MC samples for both 
signal and background. The latter was evaluated with VECBOS |JJ. The CTEQ structure 
functions f| at the scale Q = m t (Q =< pt >) for the top signal (background) were utilized. 
The LUND fragmentation model f?j was used to hadronize the quarks and/or gluons. The 
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TABLE I. Signal and background cross sections after the acceptance cuts. 

of a DO-like calorimeter. Jets are reconstructed with a simple algorithm based on the routine 
used in the LUND package and electrons are defined as isolated clusters with more than 90% 
electromagnetic energy. 

Uncorrelated MC signal samples were generated for top masses m t = 150,168, 174, 
189 and 200 GeV. Events with one-charged-lepton and four jets satisfying the following 
acceptance cuts were selected: Pt,p l t ,$ t > 20 GeV; \rf \ < 2 and ARji, ARjj > 0.7. The 
symbol pt (77) stands for transverse momentum (pseudorapidity) and the indices j = 1,4 
and I refer to the four jets and charged lepton respectively; j> t is the missing transverse 
momentum associated with the undetected neutrino and AR = (A77) 2 + (A(f>) 2 is the 
distance in the i] — <p space, where is the azimuthal angle. The cross sections after the 
acceptance cuts for the signal and the background are given in Table |. 

In order to use NN's as signal/background classifiers, we considered layered feed-forward 
NN's with topologies Ni x jV ft x N Q , (Ni, Nh and N Q are the number of input, hidden and 
output neurons, respectively), with back-propagation as the learning algorithm to minimize 
a quadratic output-error. Using a set of physical variables as inputs and taking the desired 
output as 1 for signal events and for background events, the network output gives, after 
learning, the conditional probability that new test events are of signal or background type 
provided that the signal/background ratio used in the learning phase corresponds to 
the real one. 

The robustness of the NN method is shown by making the results independent of the top 
mass, using several values in the learning and testing phases. During the learning phase a 
general network (GN) is fed with a set of events which contains a signal sample, composed 



by three subsamples corresponding to m ( = 150, 174 and 200 GeV, and a background sample 
in a 1 : 1 proportion. In so doing, the NN output loses its direct Bayesian interpretation 
when applied over data whose signal/background proportion is not 1:1. Nevertheless, the 
NN is still useful for classification ||. This way of proceeding has been shown to optimize 
the learning process and allows to use the network in a wide interval for the masses of the 
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A set of N = 15 initial variables was considered. Some of them are chosen specifically 
to pin down the a priori main characteristics of the top signal, while others are not specific 
to the signal. For each reconstructed event we compute: (1) S, the sphericity; (2) A, 
the aplanarity; (3) wiw^j the invariant mass of the hadronically decaying W; (4) pY l , the 
transverse momentum of the leptonically decaying W; (5) E?, the total transverse energy; 
(6) p\, the charged lepton transverse momentum; (7) rji, the charged lepton pseudorapidity; 
(8-11) p\, i = 1,4, the transverse momenta of the jets in decreasing order and (12-15) rji, 
i = 1,4, the jet pseudorapidities in decreasing order. The missing transverse-momentum has 
been assigned to the undetectable neutrino and its longitudinal momentum inferred along 



the lines suggested in Ref. |Tl 



In the testing phase, the GN with topology 15 x 15 x 1 is fed with new background and 
top events. The latter can be chosen with masses either corresponding to the values used 
for learning or to new values m t = 167 or 189 GeV. This differs from previous works [|T2"|,|4|| 
where the same mass values were used in both learning and testing steps. Figure [I] shows 
the reconstructed top mass obtained for five top signals and the background, corresponding 
to an integrated luminosity C = 100 pb _1 . A good top reconstruction is achieved for all 
masses considered but there is a substantial background contribution. To further appreciate 
the GN's usefulness, five specialized NN's (SN) were trained with a top mass specific to 
each one of them and a generic background common to all NN's. Again, a 1 : 1 signal to 
background ratio was used for learning. The GN and SN average errors, shown in Table [0], 
are similar for all masses considered. This indicates that the GN performs fairly well for a 
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TABLE II. Average error per event. The asterisks indicate the top mass values used in the 
General Network training. 
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FIG. 1. Reconstructed top mass distribution for several top signals and the background for 
C = 100 pb -1 . 

Nevertheless, it is clear that the window for the top mass should be reduced if the mass is 
more precisely known. 

As a complementary check to the present analysis, we have passed the first top candidates 
— published by CDF [|13| — through our initial 15 x 15 x 1 network in order to see wether 
they are compatible with our simulated signal and/or background. Although our NN was 
trained with the simulation of the DO detector, such a check is still valid, since CDF quotes 
the parton level momenta assigned to their top candidates. One can therefore process those 
events through our DO detector simulation, reconstruct the variables used in our analysis 
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TABLE III. NN output for published CDF events. 

and obtain the individual output for the published CDF top candidates. The results are 
shown in Table |T|. It can be seen that most of them give values close to 1, showing that 
they are more compatible with our signal simulation than our simulated background. 

The selection of the most relevant variables for a given process is one of the major 
problems in experimental analyses. Too many variables may introduce noise and make 
the event selection task very difficult. On the other hand, too much sensitivity may be 
lost when too few variables are used. In general, a large number of variables, N, can be 
considered and measured for an event. All A" variables carry some information on signal 
versus background differences, but it is obvious that some subset of them will be more 
valuable than other subsets for the separation task. Therefore the selection of a subset with 
the 'best' variables n (n < N), carrying the largest discrimination power between signal and 
background samples, even if lower classification efficiencies may follow, is of interest. 

In the process of reducing the number of variables, it is convenient to control the efficiency 
loss in the classification task. We suggest that NN's can be used for both the variable 
selection and the evaluation of the efficiency loss. For the former, there are several methods 
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latter will naturally be estimated in terms of the error function. When reducing the number 
of variables, it is convenient to eliminate only a few variables in one step rather than making 
multivariable rejection at once. This introduces a mild dependence of the chosen variables 
on the number of rejection steps, but turns out to be more efficient. The following approach 
was adopted: 

• Step 1: An JVxiVxl network is trained with the initial N = 15 variables and its 
final error is computed, En = E . 

• Step 2: A particular variable selection method is applied, rejecting n (keeping N — n) 
variables. (It is convenient to choose small values for n.) 

• Step 3: A new (N — n) x (N — n) x 1 network is trained with the N — n variables 
kept and its final error computed, E^- n . If the quantity Eq/En-u is larger than, for 
instance, 75%, step 2 is repeated (replacing N by iV — n) to further reduce the set of 
relevant variables. The algorithm stops if E /E^_ n < 0.75. This cut is arbitrary and 
the number of selected variables depends on it. 

We have considered three methods involving weights for the selection of the variables 
carried at step 2. For every input neuron k, the following quantities -in terms of its con- 
nections with the hidden layer units, Wki- have been considered: the sum of the weights || , 
the variances |14[ and the saliencies defined respectively as 



Method 1: W k = J2\vki\ 
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Method 2: Var(fc) = — - — 
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The surviving sets of relevant variables with error increase up to 25% : 3, 5, 8, 10, 11 for 
methods 1 and 3, and 3,8,10,11,12,15 for Method 2. The associated output-error turns 
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FIG. 2. The Statistical significance as a function of the cut on the NN output. The symbols on 
the curves and the vertical line indicate the maximum network output cuts such that more than 
five signal and five background events survive, respectively 

output-error, which corresponds to Methods 1 and 3, can be safely chosen. The relevant 
variables are the mass of the hadronically decaying W, the total transverse energy Et, and 
the jets transverse momenta p], p^ and pf. The quadratic error associated with this set of 
five variables, obtained through systematic reduction, can be compared, for instance, with 
the one obtained for the intuitive variables used in Ref. M: S, A, m^.., p t l , Et- The former 
is 18% lower than the latter, showing the usefulness of the methodical reduction. 

We have trained an NN with the five relevant variables to study the enhancement of 
the signal/background ratio as a function of the NN output cut. For a specific cut, only 
events with a network output higher than the specified cut are selected. Since the signal 
is peaked around 1 and the background around 0, it is clear that increasing the cut makes 
the signal/background ratio larger. A typical quantity that is used to reveal the existence 
of a signal is the statistical significance, defined as: S s = N s /y/Nb, where N s (N b ) is the 
number of signal (background) events passing some NN output cut. It is assumed that N b 
can be estimated with negligible error, but N s should be obtained from the actual number 
of observed events, N a , as N s = N — N b . If both quantities N b and N s are large enough 
(> 5), S s can be interpreted as the number of standard deviations that the background has 
to fluctuate to obtain the observed number of events. In such a case, the number of signal 
events is also given by N s = N a — N b ± \fW - 
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FIG. 3. Reconstructed top mass distribution for several top mass signals and the background, 
for events with outputs larger than 0.7 and C = 100 pb _1 . 

Figure shows the S s for m t = 168, 174 and 189 GeV and C = 100 pb~ x . Conservative 
limits of validity are shown in the figure. The vertical line at network outputs ~ 0.8 indicates 
the maximum network output cut such that iVj, > 5. In a similar way, the symbols on the 
curves indicate the maximum output cut such that more than five signal events still survive. 
NN output cuts between 0.6 and 0.8 increase the ratio signal/background with a minimal 
loss on the signal and a significant loss on the background. Figure |3] shows the reconstructed 
top mass with only those events with the NN output larger than 0.7. As can be observed 
the signals dominate clearly over the background. 

At this point, one can wonder about the benefits of using a reduced number of variables 
in the analysis. The main reason is to avoid possible noise when a large number of variables 
is used. In fact, the allowed increase of 25% for the average error translates into decreases for 
the signal efficiency and statistical significance. We have found that the efficiency (statistical 
significance) diminishes from 0.75 (6.8) to 0.58 (6.0) when reducing from the initial 15 to 
the final 5 variables, for an NN output cut of 0.7, value chosen because it maximizes the 
statistical significance. These can be considered dramatic losses. However, our initial number 
of variables, N = 15, was moderate and we could optimize the NN learning avoiding local 
minima. In general, this can be done for small sets of variables, but it is very difficult for 
large ones, thus being possible that NN's trained with small subsets of relevant variables 
reach better efficiencies and/or statistical significances than NN's trained with larger variable 




sets. 

We consider now some sources of systematic errors coming from eventual disagreements 
between MC and real data. In standard analyses, where single cuts are applied on single 
variables, the effects of systematic errors should be studied only in the region around the 
cuts in an easy and well understood way. In the case of an NN the only possibility to study 
the systematic error in the classification is to propagate the "estimated" systematic errors 
on the input variables to the output. Two basic effects can be considered: shifts between 
data and MC and different resolutions for the used variables. We have studied the effect of 
2% shifts and 2% change of resolution on the clusters energy. With these new energies the 
five selected variables were reconstructed to obtain a "new" test data to evaluate systematic 
effects. Notice that the 2% variation of the reconstructed cluster energies has been chosen 
for illustration purposes. This procedure automatically includes the correlations of the NN 



input variables. (There are studies in the literature where this is not the case ||16|| .) The 
results depend on the NN output cut. In the region of interest, we have found that the 
uncertainty due to systematic errors is comparable with the uncertainty coming from an 
error on m ( of ± 11 GeV. 

The application of Neural Networks to discriminate signal from background in High 
Energy Physics has been studied, using the top quark search at Fermilab as an example. 
The analysis is valid for a large range of top mass values. Special attention was paid to 
the selection of the most relevant variables. Several methods -in terms of the weights 
connecting the input and the hidden neurons- were considered. We conclude that Methods 
1 and 3, making use of the sum of the weights (in absolute value) and the weight saliencies, 
respectively, give similar results and are more suited for the variable selection than Method 2, 
using the weight variances. The performance of the reduced NN was studied in terms of the 
statistical significance. When comparing it with the initial NN, we found a small decrease 
for the statistical significance, and moderate loss of the signal efficiency. Finally, the effect 
of propagating systematic errors arising from energy shifts and changes in resolution have 
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